SCIENTIFIC 

REPORTS 




OPEN 



SUBJECT AREAS: 

SYSTEMS ANALYSIS 

COMPUTATIONAL BIOLOGY AND 
BIOINFORMATICS 



Received 
7 November 2013 

Accepted 
10 March 2014 

Published 
26 March 2014 



Correspondence and 
requests for materials 
should be addressed to 
Z.Z.C. 

(zhongzhongchen® 
gmail.com) or K.C.W. 
(kcw@alumni. Stanford. 

edu] 



GeneSense: a new approach for human 
gene annotation integrated with 
protein-protein interaction networks 

Zhongzhong Chen 1 , Tianhong Zhang 2 , Jun Lin 3 , Zidan Yan 4 , Yongren Wang 5 , Weiqiang Zheng 6 
& Kevin C. Weng 7 

1 BioMedSense Laboratory, Shanghai 200433, China, 2 Department of Otorhinolaryngology, Head and Neck Surgery, The First 
Affiliated Hospital, Harbin Medical University, Harbin 1 50081 , China, 3 Department of Biotechnology, Guilin Medical University, 
Guilin 541 004, China, 4 Shanghai Yingyun Biotech Inc., Shanghai 200433, China, 5 College of Basic Science, Zhejiang Chinese 
Medical University, Hangzhou 3 1 0053, China, 6 Department of Pathology, Changhai Hospital, Shanghai 200433, China, 7 Nplex 
Laboratory, San Jose, CA 95 1 34, USA. 

Virtually all cellular functions involve protein-protein interactions (PPIs). As an increasing number of PPIs 
are identified and vast amount of information accumulated, researchers are finding different ways to 
interrogate the data and understand the interactions in context. However, it is widely recognized that a 
significant portion of the data is scattered, redundant, not considered high quality, and not readily accessible 
to researchers in a systematic fashion. In addition, it is challenging to identify the optimal protein targets in 
the current PPI networks. The GeneSense server was developed to integrate gene annotation and PPI 
networks in an expandable architecture that incorporates selected databases with the aim to assemble, 
analyze, evaluate and disseminate protein-protein association information in a comprehensive and 
user-friendly manner. Three network models including nodenet, leafnet and loopnet are used to identify the 
optimal protein targets in the complex networks. GeneSense is freely available at www.biomedsense.org/ 
genesense.php. 

A cell can be viewed as an information processing system, receiving signals from its environment and its 
own internal state, interpreting these signals, and making appropriate cell- fate decisions 1 by regulating a 
network of interactions among the proteins encoded by its own genes. Interaction maps rather than 
individual genes and proteins provide insights to protein functions and are valuable in identifying ways to fight 
diseases 2 . Large amounts of human protein-protein interactions (PPIs) have been reported by experimental 
techniques, manual curation of literatures, and numerous computational prediction methods 3 . Protein-protein 
associations have proven to be an instrumental approach that led to the emergence of systematic and large-scale 
usage scenarios for functional association networks 4 . Ideally, the complete set of associations is assembled into a 
large network that captures the up-to-date knowledge on the functional modularity and interconnectivity in the 
cell. For example, PPIs have been used to interpret the results of genome-wide genetic screens 5 , functional 
genomics data 6,7 and elucidation of disease genes 8 . Such expanding knowledge base has the potential to improve 
the often time-consuming and cost-intensive process of biomedical analysis, and becomes a major thrust in 
systems biology research. However, this information is widely scattered and the rapid accumulation of data also 
renders it difficult to retrieve threads of information concurrently and correctly. The majority of public protein- 
protein interaction databases such as IntAct 9 , HPRD 10 , MINT 11 and BioGRID 12 archive PPI records from literat- 
ure curation or direct user submissions. Databases such as PINA 13 , APID 14 , STRING 4 , MiMI 15 and UniHI 16 
integrate information from these curated PPI databases to provide comprehensive sets of public PPIs. In addition, 
the PINA database integrates six public PPI databases, including IntAct 9 , MINT 11 , BioGRID 12 , DIP 17 , HPRD 10 , 
and MIPS Mpact 18 . Each of these databases has its own unique features with a large variation in architectural 
design and annotation. Meanwhile, these databases are heavily relied upon to facilitate studies of biological 
activities and formulate hypotheses on protein functions and cellular processes as a result of rapidly growing 
amount of public PPI data. 

With the ever increasing importance of PPIs, the challenge researchers face at this point is to efficiently 
organize and retrieve useful information from the data, which raises the following questions: (i) Can the different 
data sources be integrated in order to gather a comprehensive set of information? A major imperfection across 
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various databases is the implementation of multiple identification 
systems depending on the applications the individual database was 
designed to support or based on developer's preferences. Although 
some databases attempted to integrate multiple public databases, e.g., 
PINA, the curated information only represents partial scientific 
information, or focuses on a specific subset of biological character- 
istics. For example, the use of p53 and c-Jun in PINA whose names 
are TP53 and JUN in HGNC (the HUGO Gene Nomenclature 
Committee), respectively, does not allow for updates in future for 
the inconsistent names. A better way to integrate the existing PPI 
databases, eliminate redundancy, and prevent the compilation of 
inaccuracies is clearly needed, (ii) What are the methods to identify 
and reduce false-positive PPIs data? Suspicion has been raised about 
the quality and reliability of protein interaction data with the increas- 
ing size of available PPI databases. There are two distinct classes of 
false positives; one is biological false-positives in which the interac- 
tions can be confirmed by multiple computational methods, but the 
two proteins are in fact never present in the same cell or subcellular 
compartment at the same time; the other is technical false-positives 
that can occur in any experimental system 19 . Both computational and 
experimental methods for identifying PPIs generate some extent of 
false positives, (iii) How to identify the best associated proteins for 
in-depth query and research? Cellular functions are often critically 
dependent on the correct assembly of proteins to become functional 
multi-protein complexes through dynamic interactions of various 
components in response to signals, from internal cellular demands 
or a cell's external environment 20 . For example, the PPI network of a 
tumor suppressor kinase LKB1 and its 14 substrate kinases consisting 
of 1 3 1 proteins and 203 interactions provides hypotheses on the links 
and pathways critical for tumorigenesis following LKB1 deficiency 13 . 
However, it is difficult to identify the appropriate LKB1 target genes 
from the complex network. 

To address these challenges, we developed a web-based platform 
called GeneSense with the following three main objectives: (i) To 
provide gene annotation and integrate different data sources based 
on HGNC in which all genes are manually curated, and the HGNC 
symbols and names assigned represent a standard, acceptable to be 
used in all publications and databases where a specific gene is dis- 
cussed or referenced 21 , (ii) To build the PPI networks based on lit- 
erature and experimental data without the false-positives, (iii) To 
build a user-friendly tool comprising nodenet (node network), leaf- 
net (leaf network) and loopnet (loop network) to assist efficient 
identification of regulatory factors. 

Results 

GeneSense is developed to support and integrate gene annotation 
and the protein level network analysis. The goal of GeneSense's team 
is to design a friendly, intuitive user interface and a clear presentation 
of the results. GeneSense requires a JavaScript-enabled browser, such 
as Google Chrome and Internet Explorer. It allows users to enter the 
database via a gene of interest using its approved symbol, alias names, 
approved name or descriptions. Once users submitted the gene of 
interest, they retrieve the gene's descriptions and are informed of 
similarly described genes. Subsequently, the users can choose to 
continue with a gene summary page (Fig. 1) or abort the process 
and return to the data entry page. The results page is divided into five 
main sections: a search button to search a new gene of interest, the 
gene summary (Fig. 1), the node network for the gene of interest 
(Fig. 2A), the leaf network for the gene of interest (Fig. 2B), and the 
loop network for the gene of interest (Fig. 2C). In the PPI network 
section, a JavaScript applet will launch and the networks will be 
displayed. 

Application to gene annotation. The web page for summary (Fig. 1) 
displays the general information of the queried gene, its homologs 
information, clinical information, gene information, reference 



information, pathway information, and protein-protein interaction 
information. The general information such as the approved symbol 
and name is mainly based on HGNC 21 and implemented by Uni- 
prot 22 which provides richly and accurately annotated protein sequ- 
ence knowledgebase. Biologists studying a gene in human organisms 
often wish to transfer functional information between species and 
homologs information that helps to elucidate how the gene is related 
to other genes in a family, such as that demonstrated in TreeFam 23 , 
MGI 24 , RGD 25 , and HCOP 26 . Others databases such as GeneTests 27 , 
UCSC 28 , CiteXplore (www.ebi.ac.uk/citexplore), GeneCards 29 and 
pathway information are also linked to GeneSense. Gene informa- 
tion is based on gene definitions from HGNC 21 and related links via 
both HGNC-curated data and mapped data provided by the external 
databases. A group of homology-related links, including TreeFam 23 , 
mouse genome informatics (MGI) 24 , rat genome databases (RGD) 25 , 
and HGNC comparison of orthology predictions (HCOP) 26 are used 
to specify the homologs information in GeneSense. Clinical infor- 
mation links include GeneTests 27 , DECIPHER 30 , COSMIC 31 , and 
OMIM (http://omim.org/). Four widely used gene and genome 
browsers Entrez Gene 32 , Ensembl 33 , UCSC 28 , Vega 34 are also linked 
in GeneSense. PubMed 35 and CiteXplore (www.ebi.ac.uk/citexplore) 
hyperlinks are included in the references to provide active links to 
articles that first described the gene in question or that are parti- 
cularly relevant to the nomenclature of the gene. Additional links 
such as GeneCards 29 , GENATLAS 36 , GOPubmed 37 and H-InvDB 38 
are included in GeneSense based on HGNC. KEGG 39 information is 
used for pathway analysis in GeneSense. The threads of basic 
protein-protein interaction information fetched from different data 
sources are also listed in the summary and the associated proteins can 
also be clicked on to retrieve the corresponding gene summary 
information. 

Application to protein-protein interaction networks. PPI data- 
bases in GeneSense were integrated by IPI 40 that mapped a variety 
of accession numbers from different databases, subsequently unified 
to HGNC accession numbers. It includes a non-redundant database 
based on integration of data from IntAct 9 , MINT 11 , HPRD 10 and 
other databases, such as MEROPS 41 that can be integrated by IPI. 
The architecture of GeneSense based on HGNC and various types of 
web services offers great advantages of being easily expandable with 
different PPI data sources. The network visualization is used to 
evaluate the regulatory relationship between the queried protein 
and associated proteins, such as the network analysis of MAPK8 
gene in Fig. 2. The nodenet of MAPK8 gene (Fig. 2A) shows the 
interactions of MAPK8 and 44 downstream proteins. The leafnet 
model was further used to evaluate the interactions of downstream 
proteins in Fig. 2B. A regulatory network can exist under the 
identified post transcriptional modifications in either of two stable 
states ('upstream' or 'downstream'). The loopnet model (Fig. 2C) 
shows the visualization of MAPK8 PPI network, including down- 
stream and upstream proteins that may contribute to the under- 
standing of the mediated communication between interacted proteins. 
GeneSense can also be used to analyze larger complex networks of 
PPIs, such as the SRC PPI network (Supplementary Fig. SI). 

Discussion 

Most public PPI databases adopt diverse practices to annotate gene 
and protein-protein interaction information. These databases gather 
partial scientific information that is available, or focus on a specific 
subset of biological characteristics. The use of inconsistent names 
exists in these databases that often does not allow for later updates or 
correction of gene annotation and PPI integration from validated 
external sources. For example, c-Jun, of which the approved symbol 
name is JUN in GeneSense and HGNC, also has another synonym 
AP-1 in HGNC 21 ; the use of c-Jun in PINA does not allow for 
straightforward update or correction for network analysis with 
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Figure 1 | A screen shot of gene MAPK8 information summary page in GeneSense. The table is divided into sections that show MAPK8's general 
information, homologs information, clinical information, gene information, reference information, pathway information, and protein-protein 
interactions, respectively, with links to additional information. 



references to a variety of external resources 13,42 lacking the consid- 
eration of AP-1. Furthermore, inconsistent names also cause the use 
of the partial scientific information. Han et al 43 found that JNK (also 
named MAPK8) plays a key role in the metabolic response to obesity, 
but Pal and coworkers' research showed that JNK1 (also named 
MAPK8) activation does not account for the major diet-induced 
effects in some other experiment 44 . The discrepancy exists between 
different researches as a result of the lack of uniform nomenclature 
and the asymmetric information. Genesense prevents the use of the 
partial scientific information by using MAPK8 to unify the incon- 
sistent names. In GeneSense, the primary identifier for each record is 
the approved and updated gene symbol, which is an acronym or 
abbreviation of the associated gene name based on HGNC 21 . As a 
result, inconsistent names are unified and partial access to existing 
data is prevented. It also enables easy data tracking regardless of 
updates in the nomenclature of any given entry by assigning each 
entry to a unique 'HGNC ID' 21 . On the basis of the unified gene 
symbol name, different databases, such as IntAcf, MINT 11 , 
HPRD 10 , MEROPS 41 and other databases can be integrated into 
GeneSense. 

GeneSense is also dedicated to visualization of PPI networks of 
the coded proteins based on HGNC, IPI and PPI databases. 



Visualization can be greatly enhanced by interactive presentations 
and animation; however, high-level abstractions may limit a devel- 
oper's ability to execute fast incremental scene changes if the system 
lacks necessary information to avoid redundant computation. To 
address this problem, GeneSense cooperated with Data-Driven 
Documents (D3) seamlessly, which results in significantly faster 
page loads: twice as fast as Protovis and over three times as fast as 
Flash. Nodenet, leafnet and loopnet were built based on D3. The 
nodenet model can be useful in highlighting understudied molecu- 
lar interactions of proteins. For example, the nodenet model shows 
the interactions of MAPK8 and downstream proteins (Fig. 2A), and 
it may guide the formulation of meaningful hypotheses with regard 
to signaling pathways critical to tumorigenesis following MAPK8 
deficiency. The leafnet model helps to identify specific proteins that 
regulate the genes or proteins of interest by the leaf networks. The 
leafnet (Fig. 2B) showed that some downstream proteins such as 
MAPK1 and JUN having many interactions with other downstream 
proteins may be involved in important yet complex mechanisms in 
MAPK8 related signaling pathways; Some downstream proteins, 
such as REL and GSTP1 that do not show much interaction with 
other downstream proteins, may have a simple yet unique function 
with MAPK8. The loopnet model (Fig. 2C) can be used to assist the 
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Figure 2 | Network analysis of MAPK8 protein. The green circle indicates the node protein. The purple circles indicate leaf proteins, and the orange lines 
indicate interactions. (A) The node network of MAPK8 protein. Fig. 3A shows the interactions of MAPK8 and 44 downstream proteins. (B) The leaf 
network of MAPK8 protein. The leafnet model in Fig. 3B is used to evaluate the interactions of downstream proteins. (C) The loop network of MAPK8 
protein. The loopnet model in Fig. 3C shows the visualization of MAPK8 PPI network including downstream and upstream proteins that helps 
researchers to understand the mediated communication between interacted proteins. 



design of experiment that aims to distinguish between alternative 
mechanisms involved in the complex networks, such as the 
upstream protein MAP3K7 and downstream protein REL of 
MAPK8 can be designed to regulate MAPK8 to present bistable 
regulatory mechanisms in different ways. Moreover, a force-direc- 
ted layout algorithm 45 and D3 were applied to visualize the large 
complex networks in GeneSense, and they make the analysis of 
complex disease associated-genes relatively easy. To give another 
example, SRC kinase is a common signaling node in trastuzumab 
resistance caused by different mechanisms in HER2-positive breast 
cancers 46 . Our previous study showed that an intrinsic 40-gene set 
can be used to classify breast cancer subtypes and assist in optim- 
izing therapeutic management 47 ; however, the association between 
SRC and the intrinsic characteristic genes are unknown. Using 



GeneSense, two intrinsic genes, ESR1 and ERBB2 were identified 
as SRC downstream genes (Supplementary Fig. SI). Understanding 
the complex ways SRC interacts with its downstream genes ESR1 
and ERBB2 in specific breast cancer subtypes maybe crucial for 
discovering and analyzing mechanisms involved in trastuzumab 
resistance. 

In practice, GeneSense aims to frame the complicated PPI net- 
works in precise terms and use computer simulations to derive impli- 
cations about how the networks function in normal cells and in the 
malfunction of diseased cells supported by gene annotation. The 
following outcomes can be expected from using GeneSense: (i) to 
gain an accurate overview of genes information of interest; (ii) to 
build different models to highlight understudied molecular interac- 
tions of proteins coded by user-entered genes. 
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Figure 3 | Schematics of GeneSense platform. Gene annotation databases (top left, purple) and PPI (top right, purple) databases are integrated into 
GeneSense that allows users to query interested genes and analyze the genes by pre-constructed networks, including node network, leaf network and loop 
network. 



Methods 

Distributed architecture and data sources. GeneSense is a web-based platform that 
allows users to visualize, manipulate and analyze gene information and to find the 
optimal gene regulatory factors by corresponding protein networks. GeneSense 
database contains two parts (Fig. 3): the first part consists of gene annotation and the 
second part consists of PPI database. The gene annotation part includes homologs 
information, clinical information, and gene related information from HGNC 21 . 
Pathway information from KEGG 39 was also integrated to GeneSense by transferring 
identical records based on HGNC. PPI databases in GeneSense were integrated by 
international protein index (IPI) 40 that mapped a variety of accession numbers from 
different databases, subsequently unified to HGNC accession numbers. It includes a 
non-redundant database based on integration of data from IntAct 9 , MINT 11 , HPRD 10 
and other databases, such as MEROPS 41 that can be integrated by IPI. Interactions 
and protein information were integrated with GeneSense assuming that two proteins 
from different databases are the same if they have the same IPI accession. With 
reference to IPI, GeneSense merges results from data sources that employ different 
but compatible identifier systems. Unique PPI records in different databases were 
identified by IPI to gather a comprehensive and non-redundant protein- protein 
interaction dataset, and the protein names were subsequently unified based on 
HGNC 21 to offer consistent names and non-redundant data sets of PPI information. 
In-house gene information and in-house PPI databases include scattered data that is 
not included in the existing databases, and would be integrated into GeneSense 
manually. When users query the genes of interest in GeneSense, the integrated 
information is retrieved and presented in the gene summary part, including the 
downstream and upstream proteins according to the post transcriptional 
modification events. Furthermore, among the key features of GeneSense, three 
different network models were developed to analyze the function of proteins coded by 
the retrieved genes: Node network (nodenet) is used to observe the associated 
downstream proteins interactions with the target proteins; leaf network (leafnet) is 
used to calculate the complexity of associated downstream proteins with one another 
and assist in the identification of probable regulatory factors; loop network (loopnet) 
is used to provide an overview of the upstream and downstream relationships of 
associated proteins with the target proteins. The architecture of GeneSense based on 
HGNC and various types of web services offers great advantages of being easily 
extendable with different PPI data sources. 

Network construction and implementation. In GeneSense, queried protein (node 
protein) is represented by the central green node, and interacted proteins (leaf 
proteins) are represented by purple nodes. A node can be dragged around to change 
the arrangement of the nodes. Edges are the connections between nodes and each 
edge is associated with the reference corresponding to the interactions. GeneSense 
adopts a number of methods to annotate protein-protein interactions. First, nodenet 
in GeneSense supports basic queries of PPI network for a single protein, which can be 
used to rapidly verify whether in-lab generated PPIs are already in the public domain 
or potentially being novel. Second, GeneSense provides the leafnet network model to 
visualize the complexity of the queried protein and its substrate proteins that can be 



used to find the unique or optimal substrate proteins. Third, GeneSense provides 
loopnet to visualize the upstream and downstream targets of the queried protein, 
which reveals biological events in cells at the protein- protein interaction level. 

GeneSense platform runs on a Linux server and uses Data-Driven Documents 
(D3), an embedded domain -specific language for transforming the document object 
model (DOM) based on the data. The DOM combined a number of technologies, 
mainly, CSS for aesthetics, PHP for page content, JavaScript for interaction, SVG for 
vector graphics, and so on. Force-directed algorithm 45 and D3 was used to generate 
graphs and to determine the position of each node. Each node is subject to a repulsive 
force from every other node, yet constrained by the edges that keep nodes connected 
together. It results in a flexible layout that appears inviting as it unfolds, as exemplified 
by the nodenet that displays the pictures of the queried protein (or node) and its 
interacted proteins (or leaf). Although the nodenet model appears to be a promising 
way to display queried protein and interacted protein datasets, it does not describe 
leaf-leaf relationships and their degrees of influence. The leafnode was built based on 
the iteration process of force-directed algorithm for each leaf. The leafnode model is 
constructed in such a way that high complexity corresponds to layouts in which 
adjacent leafs are close to each other, and in which non-adjacent leafs are well-spaced. 
The high complexity leafs may play a crucial role in the signaling network, while low 
complexity leafs may participate in the regulation of the node in a relatively simpler 
way. With the aim to visualize the proteins upstream or downstream, modifications to 
the basic nodenet model were made and the loopnet model that adds the directions to 
the nodenet model was built. Loopnet reflects the upstream and downstream events 
involved in the post transcriptional modifications. 
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